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Abstract 

Background: Hot spots are residues contributing the most of binding free energy yet accounting for a small 
portion of a protein interface. Experimental approaches to identify hot spots such as alanine scanning mutagenesis 
are expensive and time-consuming, while computational methods are emerging as effective alternatives to 
experimental approaches. 

Results: In this study, we propose a semi-supervised boosting SVM, which is called sbSVM, to computationally 
predict hot spots at protein-protein interfaces by combining protein sequence and structure features. Here, feature 
selection is performed using random forests to avoid over-fitting. Due to the deficiency of positive samples, our 
approach samples useful unlabeled data iteratively to boost the performance of hot spots prediction. The 
performance evaluation of our method is carried out on a dataset generated from the ASEdb database for cross- 
validation and a dataset from the BID database for independent test. Furthermore, a balanced dataset with similar 
amounts of hot spots and non-hot spots (65 and 66 respectively) derived from the first training dataset is used to 
further validate our method. All results show that our method yields good sensitivity, accuracy and F1 score 
comparing with the existing methods. 

Conclusion: Our method boosts prediction performance of hot spots by using unlabeled data to overcome the 
deficiency of available training data. Experimental results show that our approach is more effective than the 
traditional supervised algorithms and major existing hot spot prediction methods. 




Background 

Protein-protein interactions (PPIs) are critical for almost 
all biological processes [1-3]. Many efforts have been 
made to investigate the residues at protein-protein inter- 
faces. The checking of a large number of protein-protein 
interaction interfaces has shown that there are no general 
rules, which can describe the interfaces precisely [4-10]. 
It is also well known that the binding free energy is not 
uniformly distributed over the protein interfaces, and a 
small portion of interface residues contribute the most of 
binding free energy instead [11]. These residues are 
termed as hot spots. Identifying hot spots and revealing 
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their mechanisms may provide promising prospect for 
medicinal chemistry. 

Alanine-scanning mutagenesis [12] is a popular 
method to identify hot spots by evaluating the change in 
binding free energy when substituting interface residues 
with alanine. Hot spots are defined as those sites where 
alanine mutations cause a significant change in binding 
free energy (AAG). Owing to the high cost and low effi- 
ciency of this traditional experimental method, public 
databases of experimental results such as the Alanine 
Scanning Energetics Database (ASEdb) [13] and the 
Binding Interface Database (BID) [14] contain only a lim- 
ited number of complexes. 

Some works focused on the characteristics of hot spot 
due to its critical role. Studies on the composition of hot 
spots and non-hot spots have revealed that Trp, Arg and 
Tyr rank the top 3, with the rates of 21%, 13.3% and 
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12.3% respectively. While Leu, Ser, Thr and Val are often 
disfavored [15,16]. Furthermore, hot spots are found to 
be more conserved than non-hot spots, and they are 
usually surrounded by a group of residues not important 
for binding, whose role is to shelter hot spots from the 
solvent [17]. 

Based on the existing studies on the characteristics of 
hot spots, some computational methods have been pro- 
posed to predict hot spots. These methods roughly fall 
into three categories: molecular dynamics (MD) simula- 
tions, energy-based methods and feature-based methods. 

Molecular dynamics (MD) [18-20] simulations simu- 
late alanine substitutions and estimate the correspond- 
ing changes in binding free energy. Although these 
molecular simulation methods have good performance 
on identifying hot spots from protein interfaces, they 
suffer from enormous computational cost. 

Energy-based methods use knowledge-based simplified 
models to evaluate binding free energy for predicting hot 
spots. Kortemme and Baker [21] proposed a simple physi- 
cal model using a free energy function to calculate the 
binding free energy of alanine mutation in a protein- 
protein complex. Guerois et al., [22] provided FOLDEF 
whose predictive power has been tested on a large set of 
1088 mutants spanning most of the structural environ- 
ments found in proteins. Tuncbag et al, [23] established a 
web server Hotpoint combining conservation, solvent 
accessibility and statistical pairwise residue potentials to 
computationally predict hot spots effectively. 

In recent years, some machine learning based methods 
with focus on feature selection were developed to identify 
hot-spots. Ofran and Rost [24] proposed a neural net- 
work based on sequence to predict hot spots. Darnell 
et al., [25] provided a web server KFC by using decision 
trees to predict hot spots. Some works use different fea- 
tures as input of a Support Vector Machine (SVM) classi- 
fier to predict hot spots. Cho et al., [26] developed two 
feature-based predictive SVM models for predicting 
interaction hot spots. Xia et al., [27] introduced both a 
SVM model and an ensemble classifier based on protru- 
sion index and solvent accessibility to boost hot spots 
prediction accuracy. Zhu and Mitchell [28] developed a 
new web server, named KFC2, by employing SVM with 
some newly derived features. 

Although machine learning based methods have 
obtained relatively good performance on the prediction 
of hot spots. There are still some problems remaining in 
this area. Though many features have been generated 
and used in the previous studies, effective feature selec- 
tion methods and useful feature subsets have not been 
found yet. Moreover, most of the existing methods use 
very limited data from experiment-derived deposits, 
therefore the training set is insufficient, which leads to 
unsatisfactory prediction performance. 



To deal with the problems mentioned above, in this 
paper we first extract features of both sequence and 
structure, and employ random forests [29] to generate an 
effective feature subset. Then we propose a boosting 
SVM based approach, sbSVM, to improve the prediction 
of hot spots by using unlabeled data. Our method inte- 
grates unlabeled data into the training set to overcome 
the problem of labeled data inadequacy. Finally, we evalu- 
ate the proposed method by 10-fold cross-validation and 
independent test, which demonstrate the performance 
advantage of our approach over the existing methods. 

Methods 

Datasets 

The first training data set in this study, denoted as data- 
setl, was extracted from ASEdb [13] and the published 
data by Kortemme and Baker [21]. To eliminate redun- 
dancy, we used the CATH (Class (C), Architecture (A), 
Topology (T) and Homologous superfamily (H)) query 
system with the sequence identity less than 35% and the 
SSAP score less than or equal to 80. Details are listed 
in Table 1. We define interface residues with A AG > 
2.0 kcal/mol as hot spots and those with AAG < 2.0 kcal/ 
mol as non-hot spots [26,28,30]. 

As a result, datasetl consists of 265 interface residues 
derived from 17 protein-protein complexes, where 65 
residues are hot spots and 200 residues are energetically 
unimportant residues. In order to train better predictors, 
we balanced the positive and negative samples as in [28]. 
The negative samples (non-hot spots) were divided into 3 
groups and each was combined with the positive samples 
(hot spots). The third group (66 non-hot spots) combines 
with 65 hot spots, which is denoted as datasetl and can 
obtain better results than the other two combinations 
when being used to train our predictor. 

An independent test dataset, denoted as ind-dataset, 
was obtained from the BID database [14] to further evalu- 
ate our method. In the BID database, the alanine muta- 
tions were listed as either "strong", "intermediate", "weak" 
or "insignificant". In this study, only residues with "strong" 
mutations are considered as hot spot and the others are 
regarded as non hot spot. As a result, ind-dataset consists 
of 126 interface residues derived from 18 protein-protein 
complexes, where 39 residues are hot spots and 87 resi- 
dues are energetically unimportant residues. 

As a summary, the statistics of datasetl, datasetl and 
ind-dataset are presented in Table 2. 

Features 

Based on previous studies on hot spots prediction, we 
generate 6 sequence features and 62 structure features. 
Sequence features 

The sequence features used in this paper include the 
number of atoms, electron-ion interaction potential, 
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Table 1 The details of dataset 1 . 
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H stands for Hot Spot and NH stands for Non-Hot Spot. Dataset! was derived from dafasetl. 



hydrophobicity, hydrophilicity, propensity and isoelectric 
point. These physicochemical features can be obtained 
from the AAindex database [31]. 
Structure features 

Firstly, we used the implementation PSAIA proposed by 
Mihel et al., [32] to generate features about solvent 
accessible surface area (ASA), relative solvent accessible 
surface area (RASA), depth index (DI) and protrusion 
index (PI), which are defined as follows: 

♦ Accessible surface area (ASA, usually expressed in 
A 2 ) is the atomic surface area of a molecule, protein 
and DNA etc., which is accessible to a solvent. 

♦ Relative ASA (RASA) is the ratio of the calculated 
ASA over the referenced ASA. The reference ASA 
of a residue X is obtained by Gly-X-Gly peptide in 
extended conformations [33]. 

♦ Depth index (DI): the depth of an atom i (DPXi) can 
be defined as the distance between atom i and the clo- 
sest solvent accessible atom /. That is, DPXi = mm{d lt 
di, d 3 , d n ) where d\, d 2 , d 3 , d n are the distances 
between the atom i and all solvent accessible atoms. 

♦ Protrusion index (PI) is defined as V ext l V int . Here, 
V int is given by the number of atoms within the 



Table 2 Statistics of dataset], datasetl and ind-dataset. 



Dataset 


Number of hot 
spots 


Number of non-hot 
spots 


Total 
number 


dataset] 


65 


200 


265 


datasetl 


65 


66 


131 


ind- 
dataset 


39 


87 


126 



sphere (with a fixed radius R) multiplied by the mean 
atomic volume found in proteins; V ext is the differ- 
ence between the volume of the sphere and V int , 
which denotes the remaining volume of the sphere. 

From ASA and RASA, five attributes can be derived: 

♦ total (the sum of all atom values); 

♦ backbone (the sum of all backbone atom values); 

♦ side-chain (the sum of all side-chain atom values); 

♦ polar (the sum of all oxygen, nitrogen atom 
values); 

♦ non-polar (the sum of all carbon atom values). 

And based on DI and PI, four residue attributes can 
be obtained: 

♦ total mean (the mean value of all atom values); 

♦ side-chain mean (the mean value of all side-chain 
atom values); 

♦ maximum (the maximum of all atom values); 

♦ minimum (the minimum of all atom values). 

Therefore, 36 features were generated by PSAIA from 
unbound and bound states. 

In addition, the relative changes of ASA, DI and PI 
between the unbound and bound states of the residues 
were calculated as in Xia et al's work [27], and 13 more 
features were generated by the equations below: 

RcASA = (ASAunbound — ASAb ouni i)/ASA un }, ouni i, 
RcDI = {Dlbound - DI un bound)/DIbound, 
RcPI = {Plunbound ~ PIbound) / Plunbound- 
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Furthermore, we generated some useful features fol- 
lowing the strategy of KFC2 [28]. Residues' solvent 
accessible surface is used in the following features and is 
calculated by NACCESS [34]. 

DELTA_TOT describes the difference between the sol- 
vent accessible surfaces in bound and unbound states: 

DELTA.TOT = ASAunb - ASAbnd. 

SA_RATI05 is the ratio of solvent accessible surface 
area over maxASA, which stands for the residue's maxi- 
mum solvent accessible surface area as a tripeptide [35]: 



SAJIATI05 



DELTA_TOT x maxASA 



ASAunb 

Another form of ratio of solvent accessible surface 
area, CORE_RIM, is given by: 



COREJIIM = 



DELTA.TOT 
ASAunb 



and this feature is quite like the relative change in 
total ASA described before. The main difference lies in 
that PSAIA treats each chain separately during the cal- 
culation [32]. In our work we will use at most one of 
these two features in order to avoid a bias. 

POS_PER is defined as below, where i is the sequence 
number of the residue and N is the total number of the 
interface residues: 

POSTER = COREJIIM x — . 

N 

ROT4 and ROT5 stand for the total numbers of the 
side chain rotatable single bonds to target residues for 
the residues within 4.0A and 5.0 A, respectively. 

HP5 is the sum of hydrophobic values of all neighbors 
of a residue within 5A. 

FP9N, FP9E, FP10N and FP10E were directly calculated 
by FADE [36] that is an efficient method to calculate 
atomic density. 

PLAST 4 and PLAST 5 were calculated as: 



PLAST4 = 



WTJIOT4 



PLAST5 = 



ATMN4 x maxASA' 
WT_ROT5 

ATMN5 x maxASA' 



where WT_ROT4, WT_ROT5 count weighted rotatable 
single bond numbers of a residue's side chain within 4A 
and 5A respectively, and A TMN4, ATMN5 indicate the 
total numbers of surrounding atoms of a residue within 
4A and SA respectively. 

Feature selection 

Feature selection is an important step in training classi- 
fiers and is often utilized to improve the performance of 



a classifier by removing redundant and irrelevant 
features. 

In this work, 68 features were generated initially. Such a 
feature set may cause over-fitting of the model. Therefore, 
we employed random forests proposed by Breiman [29] to 
find important features, with which to get better discrimi- 
nation of hot spot residues and non-hot spot residues. 

Random forests are a combination of tree predictors 
such that each tree depends on the values of a random 
vector sampled independently and with the same distribu- 
tion for all trees in the forests. Random forests return sev- 
eral measures of variable importance. The most reliable 
measure is based on the decrease in classification accuracy 
when the values of a variable in a node of a tree are per- 
muted randomly [37]. 

Figure 1 shows the importance of all 68 features for 
hot spots prediction on datasetl. We can clearly see 
how each of the features affects the accuracy of predic- 
tion. In our study, we selected the top-10 features 
whose values of importance are significantly higher than 
the others', and then tried various combinations to get 
the best prediction result. The features that we chose 
for datasetl are: relative change in side-chain ASA upon 
complexation, relative change in side-chain mean PI 
upon complexation, CORE_RIM, SA_RATI05, total 
RASA, DELTA_TOT. 

The feature importance of the balanced training data 
set, datasetl, is illustrated in Figure 2. Here, we still tried 
various combinations from the top-10 features. The fea- 
tures we used in the prediction model for datasetl are: 
SA_RATI05, relative change in side-chain mean PI upon 
complexation, relative change in minimal PI upon com- 
plexation, relative change in total ASA upon complexa- 
tion, s-chain RASA, relative change in polar ASA upon 
complexation. 

SemiBoost framework 

Mallapagada et al., [38] presented a boosting framework 
for semi-supervised learning to improve supervised 
learning, termed as SemiBoost, by using both labeled 
data and unlabeled data in the learning process. The fra- 
mework is given as follows. 

Given a data set D = {x v x 2 , x 3 , . . ., n„}, the labels for 
the entire dataset can be denoted as y = \y t ; y u ] where the 
labeled subset is denoted by y\ = (y\, y\, ... , y l ) and the 
unlabeled subset is denoted by y u = [y u v y\, y\) 
with n = ni + n u . It can be assumed that an unlabeled 
data x u and a labeled data with the highest similarity to 
x u may share the same label. The symmetric matrix S lu 
represents the similarity between labeled and unlabeled 
data. The term F t {y, S lu ) stands for the inconsistency 
between labeled and unlabeled data. It can also be 
assumed that two unlabel data points with the highest 
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Figure 1 The importance of all 68 features {datasetl) Feature importance generated by random forests. The top-10 features were picked out 
and various combinations were tested by 10-fold cross-validation to find the best feature subset for prediction of hot spots. 
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Figure 2 The importance of all 68 features (datasetl) Feature importance generated by random forests. The top-10 features were picked out 
and various combinations were tested by 10-fold cross-validation to find the best feature subset for prediction of hot spots. 
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similarity may share the same label. The symmetric 
matrix S uu represents a similarity matrix based on the 
unlabeled data. The term F u (y u , S uu ) stands for the incon- 
sistency among unlabeled data. Thus an objective func- 
tion F(y, S) can be obtained from the above two terms. 
Our goal is to find the label y u that minimizes F{y, S). 
Concretely, the objective function is given as 



F(y,S)=Fi(y,S ,u ) + CF u {y u ,S m ) 



(1) 



where C weights the importance between the labeled 
and unlabeled data. The two terms in (1) are given as 
follows: 



(2) 



1=1 ;=1 



Fu(yu,Sr) = Y / g%exp{rf-$). (3) 

y=i 

Let h c (x) denote the classifier trained at the t-th itera- 
tion by the underlying learning algorithm A and H(x) 
denote the combined classifier, we have 



H{x) = J2<Xth\x) 



(4) 



where a t is the combination weight. Then, the learn- 
ing problem is transformed to the following optimiza- 
tion problem: 

+C Y^i'. 1 Sij exp(Hi - Hj) exp{a{hi - hj))® 
s.t. h{xi) = y\,i= 1/ • • • / n\. 

By variable substitution and regrouping, (5) can be 
transformed into 



F\ = ^2 ex P{—2 a hi)Pi + exp(2ahi)cji 



where 



(6) 



(7) 



C 



qi = Y,^ m '^p-^^H^'- H '- (8) 



;'=i 



Above, pi and q t are considered as the confidences in 
classifying the unlabeled data into the positive and nega- 
tive classes respectively. 

The SemiBoost algorithm starts with an empty ensem- 
ble. At each iteration, it computes the confidence for unla- 
beled data and then assigns the pseudo-labels according to 
both the existing ensemble and the similarity matrix. The 



most confident pseudo-labeled data are combined with 
the labeled data to train a classifier using the supervised 
learning algorithm. The ensemble classifier is updated by 
the former classifiers with appropriate weights, and the 
iteration is stopped when a <0, here 

a "4 n EJift*(^-i) + EZ:i^i)" 

Mallapagada et al. proved the performance improve- 
ment on the supervised algorithms by using SemiBoost 
on different datasets, and SemiBoost outperforms the 
benchmark semi-supervised algorithms [38]. 

SVM 

In this paper, we employed the support vector machine 
(SVM) as the underlying supervised learning algorithm 
in the SemiBoost framework. 

SVM was first developed by Vapnik [39] and was ori- 
ginally employed to find a linear separating hyperplane 
that maximizes the distance between two classes. SVM 
can deal with the problems that can not be linearly 
separated in the original input space by adding a penalty 
function of violation of the constraints to the optimiza- 
tion criterion or by transforming the input space into a 
higher dimension space. It was widely used for develop- 
ing methods in Bioinformatics and has been proved to 
be effective in predicting hot spots [27,28,30]. 

sbSVM: an SVM with semi-supervised boosting to predict 
hot spots 

In this study, we propose a new method that combines the 
semi-supervised boosting framework with the underlying 
supervised learning algorithm SVM to predict hot spots. 

In the original SemiBoost framework proposed by 
Mallapagada et al., both confidence values of p t and q t 
might be large and there no any persuasive criterion to 
choose the most confident unlabeled data. Directly 
choosing the top 10% of the unlabeled data will include 
too many ambiguous samples with pseudolabel at the 
early iterations. 

In order to overcome the above problem, we modified the 
terms in Equation (2) and Equation (3) by assigning weights 
according to the similarity matrix S ul and S uu as follows: 

. *E£iS#e*p(-2#H i + «h,)) 
arg mm 0 E 

h[x),a I=1 E S£j 

* E ; ""i exptH, - Hi) eK P (a(hj - h,))(10) 

+fH 

i 

s.t. h{xj)=y l yi=\,...,ni 

where <j> = 1/(1 + j) and \jr = C/(l + j). C is the tuning 
parameter for the importance of the labeled and unlabeled 
data, and we set its default value to nj/n u . Given the above 
function, we can obtain the values of p t and q t as follows: 
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1 + 



2 j=l 



(„-2Hj 



y"s™e H '- H ',(l2) 



which will have the maximum of 1. Then we sample 
the unlabeled data according to the following two cri- 
teria: (1) \pt - q t \ > 0.3, (2) Top 10% \p t - q t \. With 
that, we can assign pseudolabels to unlabeled data 
according to sign{pi - qij, and choose the most credible 
ones for training the classifier. 

At each iteration, like the original SemiBoost framework, 
we update the ensemble classifier H(x) with H(x) + a t h t {x). 
The algorithm stops when the number of iterations 
reaches T (a predefined parameter) or a <0. Figure 3 illus- 
trates the basic workflow of the sbSVM approach. The 
similarity matrices are calculated initially and play an 
important role in selecting unlabeled samples. The unla- 
beled data with highest confidence will be added to the 
training set for the next iteration of training. 

Performance evaluation 

To evaluate the classification performance of the 
method sbSVM proposed in this study, we adopted 
some widely used measures, including precision, recall 
(sensitivity), specificity, accuracy and Fl score. These 
measures are defined as follows: 



Precision ■■ 



TP 



[TP + FP) ' 



Recall(sensitivity) 



TP 



(TN + FP) 



Specificity -■ 
Accuracy = 



TN 



Fl = 2 x 



(TN + FP)' 

[TP + TN) 
[TP + FP + TN + FN) ' 
Precision x Recall 
Precision + Recall 



Here, TP, FP, TN and FN denote the numbers of true 
positives (correctly predicted hot spot residues), false 
positives (non-hot spot residues incorrectly predicted as 
hot spots), true negatives (correctly predicted non-hot 
spot residues) and false negatives (hot spot residues 
incorrectly predicted as non-hot spot residues), respec- 
tively. Fl score is a composite measure, which is widely 
used to evaluate prediction accuracy considering both 
precision and recall. 

Results and discussion 

Parameter selection 

The similarity matrices S* and S uu are computed by the 
radial basis function. For example, let Xi and Xj be two 
samples from the dateset, the similarity between them is 
calculated by Sy = exp(- (x/ - x^/lo 2 ), where a is the 
scale parameter that has a great impact on the perfor- 
mance of the learning algorithm. We tested 10 values of 
a from 1 to 10 in a 10-fold cross-validation on datasetl 
to get the best performance of our method. The perfor- 
mance of our method varies according to the value of a, 
which is listed in Table 3. We chose the value of 3 for o 
that produces the best performance. And for datasetl, 
our method has the best performance when a is set to 1. 

The optimization process will stop when a <0 during 
the iterations. However, in order to avoid a slow conver- 
gence, we set the maximum number of iterations T = 20. 

Performance comparison and cross-validation 

In this section, the performance of sbSVM is examined 
and compared with three existing machine learning 
methods, including SVM [39], Bayes network [40] and 
decision tree C4.5 [41]. We first conducted several 
cross-validation (10/7/5/2-folds) tests and an additional 
test called random-20 test (where we randomly chose 20 
samples from the training dataset to train the predictor 
and then perform prediction on the remaining data. 
This process was repeated 10 times to get the averaged 
result) on datasetl to show that the boosting with unla- 
beled data method, sbSVM, outperforms the other three 
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Figure 3 The workflow of sbSVM. The labeled data is input and similarity matrices are calculated before the iteration. During each iteration, 
some of the unlabeled data that have the highest classification confidence will be sampled into the training dataset for the next iteration. 
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Table 3 The performance of sbSVM when o~ changes from 
1 to 10 with stepsize = 1 (cross-validation on datasetl). 
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methods. The experimental results (Fl scores) are 
shown in Figure 4. From Figure 4, we can see that even 
when the training data is small, sbSVM still outperforms 
the others. As all the results of decision tree are less 
than 0.45, we do not show them in Figure 4. 

Our approach was further compared with other five 
existing hot-spot prediction methods by 10-fold cross- 
validation on datasetl. The compared methods include 
KFC [25], Robetta [21], FOLDEF [22], MIN-ERVA [26] 
and KFC2 [28]. 

The results of the methods compared were collected 
from the original papers where these methods were 
published. All results are listed in Table 4. We can see 
that sbSVM has the best recall of 0.82 among all these 
methods, and its Fl-score is only outperformed by 
MINERVA. Besides, the specificity and accuracy of our 
method are also competitive. Table 5 shows the results 
of 10-fold cross-validation on datasetl. We can see that 



Table 4 The cross-validation results on datasetl. 



Methods 


Recall 


Precision 


Specificity 


Accuracy 


F1 


KFC 


0.55 


0.58 


0.85 


0.78 


0.56 


Robetta 


0.49 


0.62 


0.9 


0.8 


0.55 


FOLDEF 


0.32 


0.59 


0.93 


0.78 


0.41 


MINERVA 


0.58 


0.73 


0.89 


0.82 


0.65 


sbSVM 


0.82 


0.5 


0.74 


0.76 


0.62 



our method has outstanding performance, with the 
highest recall (0.89) and Fl score (0.80). Figure 5 illus- 
trates the ROC curves of our method on both datasets. 
The area under the curves are 0.764 (datsetl) and 0.719 
{datasetl). 

Independent test 

Here we evaluate sbSVM and compare it with other 
methods by independent test on ind-dataset described 
in the Method section. The results are presented in 
Table 6 and Table 7. Performance results of the com- 
pared methods were obtained from their corresponding 
web servers. 

Table 6 shows that when our method sbSVM was 
trained on datasetl and tested on ind-dataset, we obtain 
the highest recall (0.77) and Fl score (0.58). 

Table 7 demonstrates that when our method was 
trained on the balanced dataset datasetl and tested on 
ind-dataset, our method still get the highest Fl score 




0.5 - 



0.45 1 1 1 1 

1 0fold 7fold 5fold 2fold ran20 

Figure 4 The comparison of different methods by cross-validation. Among all methods, sbSVM has the highest Fl-score. sbSVM improves 
the prediction performance even when the training dataset is small. 
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Table 5 The cross-validation results on datasefl. Table 6 Independent test results (sbSVM was trained on 



Methods 


Recall 


Precision 


Specificity 


Accuracy 


F1 


dataset 1 ). 












KFC 


0.55 


0.81 


0.88 


0.70 


0.66 


Methods 


Recall 


Precision 


Specificity 


Accuracy 


F1 


Robetta 


0.51 


0.8 


0.88 


0.7 


0.62 


KFC 


0.31 


0.48 


0.85 


0.69 


0.38 


FOLDEF 


0.31 


0.8 


0.93 


0.62 


0.44 


Robetta 


0.33 


0.52 


0.87 


0.71 


0.4 


MINERVA 


0.58 


0.93 


0.96 


0.77 


0.72 


FOLDEF 


0.26 


0.48 


0.88 


0.69 


0.34 


KFC2 


0.78 


0.77 


0.78 


0.78 


0.78 


MINERVA 


0.44 


0.65 


0.9 


0.76 


0.52 


sbSVM 


0.89 


0.73 


0.68 


0.79 


0.8 


sbSVM 


0.77 


0.46 


0.6 


0.66 


0.58 



(0.64), and its other measures, recall (0.72), specificity 
(0.77) and accuracy (0.76) are still competitive among 
all tested methods. 

Remarks on the selected features 

In this paper, we extracted a large set of features from 
previous studies, but only several were used in hot-spot 
prediction. The selected features for datasetl and data- 
setl are listed in Table 8. Note that none of the 
sequence features were chosen in the two final feature 
combinations for datasetl and datasetl. This may imply 
that general sequence information is not so important 
in hot spot prediction. 

The relative change in side-chain ASA upon complexa- 
tion, the relative change in total ASA upon complexation, 
SA_RATI05 and CORE_RIM measure from different 



aspects the changes in accessible surface of a residue 
between unbound and bound states. These structural fea- 
tures were all chosen in our prediction, which suggests 
that residues surrounded by others and sheltered from sol- 
vents are more likely to be hot spots [17]. Meanwhile, the 
two different relative changes in Protrusion Index (relative 
change in side-chain mean PI upon complexation and 
relative change in minimal PI upon complexation) used in 
our method are also strong evidence of hot spots. It was 
found that hot spots tend to protrude into complementary 
pockets [17]. Therefore, these selected structural features 
also suggest that the high local packing density of a resi- 
due is helpful in predicting hot spots [42]. 

As the structural information used in this paper indi- 
cate the nature of hot spots, our approach obtained the 
highest recall in hot spot prediction. 



ROC curve 




0 0.2 0.4 0.6 0.8 1 

1 -specificity 

Figure 5 ROC curves of sbSVM on dataset 1 and dataset 2. The area under the curves are 0.764 [datset 1) and 0.758 (dataset 2). 
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Table 7 Independent test results (sbSVM was trained on dataset2). 



Methods 


Recall 


Precision 


Specificity 


Accuracy 


F1 


KFC 


0.33 


0.42 


0.79 


0.65 


0.37 


Robetta 


0.39 


0.58 


0.87 


0.72 


0.46 


FOLDEF 


0.26 


0.48 


0.87 


0.69 


0.33 


MINERVA 


0.46 


0.69 


0.91 


0.77 


0.55 


KFC2 


0.74 


0.56 


0.74 


0.74 


0.64 



sbSVM 0.82 0.51 0.64 0.70 0.63 



Table 8 Selected features for datasetl and dataset2. 



Selected features for dataset] 


Selected features for datasetl 


relative change in side-chain ASA upon complexation 


SA_RAT!05 


relative change in side-chain mean PI upon complexation 


relative change in side-chain mean PI upon complexation 


CORE_RIM 


relative change in minimal PI upon complexation 


SA_RATI05 


relative change in total ASA upon complexation 


total RASA 


s-chain RASA 


DELTAJOT 


relative change in polar ASA upon complexation 



Case study 

EPO (Erythropoietin) is produced by interstitial fibro- 
blasts in the kidney, which is in close association with 
peritubular capillary and tubular epithelial cells. It is the 
hormone that regulates red blood cell production. 

There exists a competition between EMP1 (pdbl- 
D:lebp, chainC) and EPO to bind the erythropoietic 
receptor (EPOR) (pdbID:lebp, chainA) [43]. Experimen- 
tally found hot spots at the lebpAC interface are F93A, 
M150A, F205A and W13C, and T151A, L11C and T12C 



were found experimentally to be non-hot spots (in BID). 
Our method predicts correctly two out of the four hot 
spots - M150A and F205A, and all of the three non-hot 
spots. 

Figure 6(a) shows the experimental results on chain A 
of EMP1. Red color indicates the residues F93A, M150A 
and F205A, which were found to be hot spots. Figure 6 
(b) shows the prediction results of our method sbSVM 
on chain A. Here, red color shows the hot spots M150A 
and F205A. 







. v <k & 

vr i§* 


(a) 

Figure 6 A case study. The visualization of prediction results on chain A 
results; (b) Computational results predicted by our method sbSVM. 


(b) 

of EMP1. Red color indicates hot spots, (a) Physical experimental 
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Conclusions 

In this study we proposed a new effective computational 
method, named sbSVM, to identify hot spots at the protein 
interfaces. We combined sequence and structure features, 
and selected the most important features by random for- 
ests. Our method is based on a semi-supervised boosting 
framework that samples some useful unlabeled data at 
each iteration to improve the performance of the underly- 
ing classifier (SVM in this paper). The performance of 
sbSVM was evaluated by 10-fold cross-validation and inde- 
pendent test. Results show that our approach, with the best 
sensitivity and Fl score, can provide better or at least com- 
parable performance than or to the major existing meth- 
ods, including KFC, Roberta, FOLDEF, MINERVA and 
KFC2. 

Our study has achieved substantial improvement on per- 
formance of hot spots prediction by using the unlabeled 
data. In our future work, on the one hand we will explore 
more useful features of both hot spots and non-hot spots, 
and on the other hand, we will try to develop more sophis- 
ticated hot spot prediction methods based on advanced 
machine learning techniques (e.g., transfer learning and 
spare representation). 
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